Dataset Descrtiption¶
- assembly_session: This column represents the session number of the assembly. It likely serves as a unique identifier for each assembly session.
- state_code: This column contains numerical codes that represent different states. These codes are likely standardized identifiers for each state.
- state_name: This column contains the names of the states corresponding to the state codes. Each state name is associated with its respective state code.
- all_votes: This column represents the total number of votes cast in each assembly session. It provides an overall count of votes, including 'yes' votes, 'no' votes, and abstentions.
- yes_votes: This column represents the number of 'yes' votes cast in each assembly session. It indicates the count of votes in favor of a particular motion or proposal.
- no_votes: This column represents the number of 'no' votes cast in each assembly session. It indicates the count of votes against a particular motion or proposal.
- abstain: This column represents the number of abstentions in each assembly session. It indicates the count of members who chose not to vote either in favor or against a particular motion.
- idealpoint_estimate: This column contains numerical estimates representing the ideal point of the assembly for each session. It could be a measure of the assembly's preferred position on a particular issue or policy.
- affinityscore_usa: This column contains affinity scores representing the relationship or similarity between each state and the United States. A higher affinity score indicates a stronger perceived alignment or similarity with the United States.
- affinityscore_russia: This column contains affinity scores representing the relationship or similarity between each state and Russia. -Similar to affinityscore_usa, a higher score indicates a stronger perceived alignment or similarity with Russia.
- affinityscore_china: This column contains affinity scores representing the relationship or similarity between each state and China. Similar to the previous columns, a higher score indicates a stronger perceived alignment or similarity with China.
- affinityscore_india: This column contains affinity scores representing the relationship or similarity between each state and India. A higher score indicates a stronger perceived alignment or similarity with India.
- affinityscore_brazil: This column contains affinity scores representing the relationship or similarity between each state and Brazil. A higher score indicates a stronger perceived alignment or similarity with Brazil.
- affinityscore_israel: This column contains affinity scores representing the relationship or similarity between each state and Israel. A - -higher score indicates a stronger perceived alignment or similarity with Israel.
This dataset appears to capture voting patterns and relationships between different states and various countries based on affinity scores. It provides valuable insights into state-level dynamics and international relations.
Tools and Libraries¶
- Pandas: Used for importing the dataset, performing data manipulation, and conducting exploratory analysis.
- Matplotlib: Employed for creating various types of plots, including histograms, scatter plots, and regression visualizations.
- Seaborn: Enhances the visualization aesthetics and provides additional plotting functions for statistical analysis.
- Scikit-learn: Utilized for building regression models to predict median house values based on the dataset features.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("states.csv")
df.describe(exclude='object')
| year | assembly_session | state_code | all_votes | yes_votes | no_votes | abstain | idealpoint_estimate | affinityscore_usa | affinityscore_russia | affinityscore_china | affinityscore_india | affinityscore_brazil | affinityscore_israel | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 |
| mean | 1987.037125 | 42.037125 | 446.914613 | 75.246674 | 59.793235 | 5.647932 | 9.805507 | -0.000279 | 0.293589 | 0.620304 | 0.752558 | 0.687353 | 0.733821 | 0.350540 | 1.061978 |
| std | 18.478671 | 18.478671 | 258.472010 | 33.043167 | 32.657278 | 8.268681 | 9.713783 | 0.989763 | 0.203660 | 0.202982 | 0.160692 | 0.195748 | 0.186575 | 0.189557 | 0.925201 |
| min | 1946.000000 | 1.000000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | -2.562400 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1973.000000 | 28.000000 | 220.000000 | 58.000000 | 38.000000 | 0.000000 | 4.000000 | -0.661100 | 0.140800 | 0.512200 | 0.723100 | 0.526300 | 0.615400 | 0.193500 | 0.000000 |
| 50% | 1989.000000 | 44.000000 | 437.000000 | 68.000000 | 57.000000 | 2.000000 | 7.000000 | -0.175500 | 0.235300 | 0.652200 | 0.761200 | 0.754100 | 0.800000 | 0.325900 | 1.000000 |
| 75% | 2003.000000 | 58.000000 | 660.000000 | 86.000000 | 71.000000 | 9.000000 | 13.000000 | 0.808900 | 0.388100 | 0.737700 | 0.869600 | 0.835800 | 0.880000 | 0.466700 | 2.000000 |
| max | 2015.000000 | 70.000000 | 990.000000 | 158.000000 | 156.000000 | 98.000000 | 73.000000 | 3.004200 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 |
df.describe(include='object')
| state_name | |
|---|---|
| count | 9697 |
| unique | 197 |
| top | United States of America |
| freq | 69 |
import pandas as pd
# Check unique country names in the DataFrame
unique_countries = df['state_name'].unique()
print(unique_countries)
['United States of America' 'Canada' 'Bahamas' 'Cuba' 'Haiti' 'Dominican Republic' 'Jamaica' 'Trinidad and Tobago' 'Barbados' 'Dominica' 'Grenada' 'St. Lucia' 'St. Vincent and the Grenadines' 'Antigua & Barbuda' 'St. Kitts and Nevis' 'Mexico' 'Belize' 'Guatemala' 'Honduras' 'El Salvador' 'Nicaragua' 'Costa Rica' 'Panama' 'Colombia' 'Venezuela' 'Guyana' 'Suriname' 'Ecuador' 'Peru' 'Brazil' 'Bolivia' 'Paraguay' 'Chile' 'Argentina' 'Uruguay' 'United Kingdom' 'Ireland' 'Netherlands' 'Belgium' 'Luxembourg' 'France' 'Monaco' 'Liechtenstein' 'Switzerland' 'Spain' 'Andorra' 'Portugal' 'German Federal Republic' 'German Democratic Republic' 'Poland' 'Austria' 'Hungary' 'Czechoslovakia' 'Czech Republic' 'Slovakia' 'Italy' 'San Marino' 'Malta' 'Albania' 'Montenegro' nan 'Macedonia' 'Croatia' 'Yugoslavia' 'Bosnia and Herzegovina' 'Slovenia' 'Greece' 'Cyprus' 'Bulgaria' 'Moldova' 'Romania' 'Russia' 'Estonia' 'Latvia' 'Lithuania' 'Ukraine' 'Belarus' 'Armenia' 'Georgia' 'Azerbaijan' 'Finland' 'Sweden' 'Norway' 'Denmark' 'Iceland' 'Cape Verde' 'Sao Tome and Principe' 'Guinea-Bissau' 'Equatorial Guinea' 'Gambia' 'Mali' 'Senegal' 'Benin' 'Mauritania' 'Niger' 'Ivory Coast' 'Guinea' 'Burkina Faso' 'Liberia' 'Sierra Leone' 'Ghana' 'Togo' 'Cameroon' 'Nigeria' 'Gabon' 'Central African Republic' 'Chad' 'Congo' 'Democratic Republic of the Congo' 'Uganda' 'Kenya' 'Tanzania' 'Burundi' 'Rwanda' 'Somalia' 'Djibouti' 'Ethiopia' 'Eritrea' 'Angola' 'Mozambique' 'Zambia' 'Zimbabwe' 'Malawi' 'South Africa' 'Namibia' 'Lesotho' 'Botswana' 'Swaziland' 'Madagascar' 'Comoros' 'Mauritius' 'Seychelles' 'Morocco' 'Algeria' 'Tunisia' 'Libya' 'Sudan' 'South Sudan' 'Iran' 'Turkey' 'Iraq' 'Egypt' 'Syria' 'Lebanon' 'Jordan' 'Israel' 'Saudi Arabia' 'Yemen Arab Republic' "Yemen People's Republic" 'Kuwait' 'Bahrain' 'Qatar' 'United Arab Emirates' 'Oman' 'Afghanistan' 'Turkmenistan' 'Tajikistan' 'Kyrgyzstan' 'Uzbekistan' 'Kazakhstan' 'China' 'Mongolia' 'Taiwan' 'North Korea' 'South Korea' 'Japan' 'India' 'Bhutan' 'Pakistan' 'Bangladesh' 'Myanmar' 'Sri Lanka' 'Maldives' 'Nepal' 'Thailand' 'Cambodia' 'Laos' 'Vietnam' 'Malaysia' 'Singapore' 'Brunei' 'Philippines' 'Indonesia' 'East Timor' 'Australia' 'Papua New Guinea' 'New Zealand' 'Vanuatu' 'Solomon Islands' 'Kiribati' 'Tuvalu' 'Fiji' 'Tonga' 'Nauru' 'Marshall Islands' 'Palau' 'Federated States of Micronesia' 'Samoa']
unique_countries_count = df['state_name'].nunique()
print(unique_countries_count)
197
df.head(20)
| year | assembly_session | state_code | state_name | all_votes | yes_votes | no_votes | abstain | idealpoint_estimate | affinityscore_usa | affinityscore_russia | affinityscore_china | affinityscore_india | affinityscore_brazil | affinityscore_israel | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1946.0 | 1.0 | 2 | United States of America | 42.0 | 25.0 | 15.0 | 2.0 | 1.7377 | 1.0 | 0.2143 | 0.752558 | 0.4762 | 0.6429 | 0.35054 | 0 |
| 1 | 1947.0 | 2.0 | 2 | United States of America | 38.0 | 27.0 | 10.0 | 1.0 | 1.8417 | 1.0 | 0.2632 | 0.752558 | 0.2973 | 0.8421 | 0.35054 | 0 |
| 2 | 1948.0 | 3.0 | 2 | United States of America | 103.0 | 46.0 | 54.0 | 3.0 | 1.9909 | 1.0 | 0.1275 | 0.752558 | 0.3700 | 0.7767 | 0.16670 | 0 |
| 3 | 1949.0 | 4.0 | 2 | United States of America | 63.0 | 17.0 | 33.0 | 13.0 | 1.9395 | 1.0 | 0.1111 | 0.752558 | 0.3651 | 0.5397 | 0.51610 | 0 |
| 4 | 1950.0 | 5.0 | 2 | United States of America | 53.0 | 26.0 | 25.0 | 2.0 | 1.8651 | 1.0 | 0.1731 | 0.752558 | 0.5094 | 0.8113 | 0.60420 | 0 |
| 5 | 1951.0 | 6.0 | 2 | United States of America | 25.0 | 10.0 | 11.0 | 4.0 | 1.8919 | 1.0 | 0.1200 | 0.752558 | 0.3600 | 0.6400 | 0.65220 | 0 |
| 6 | 1952.0 | 7.0 | 2 | United States of America | 49.0 | 25.0 | 19.0 | 5.0 | 1.9617 | 1.0 | 0.1429 | 0.752558 | 0.3061 | 0.6531 | 0.63270 | 0 |
| 7 | 1953.0 | 8.0 | 2 | United States of America | 25.0 | 12.0 | 6.0 | 7.0 | 1.7707 | 1.0 | 0.2000 | 0.752558 | 0.3333 | 0.6400 | 0.52000 | 0 |
| 8 | 1954.0 | 9.0 | 2 | United States of America | 30.0 | 18.0 | 3.0 | 9.0 | 1.5565 | 1.0 | 0.2000 | 0.752558 | 0.3000 | 0.6333 | 0.43330 | 0 |
| 9 | 1955.0 | 10.0 | 2 | United States of America | 27.0 | 13.0 | 8.0 | 6.0 | 1.8166 | 1.0 | 0.1481 | 0.752558 | 0.1111 | 0.7778 | 0.48150 | 0 |
| 10 | 1956.0 | 11.0 | 2 | United States of America | 60.0 | 44.0 | 11.0 | 5.0 | 1.3449 | 1.0 | 0.2167 | 0.752558 | 0.3667 | 0.9167 | 0.58330 | 0 |
| 11 | 1957.0 | 12.0 | 2 | United States of America | 34.0 | 22.0 | 7.0 | 5.0 | 1.3156 | 1.0 | 0.2353 | 0.752558 | 0.4118 | 0.8235 | 0.61760 | 0 |
| 12 | 1958.0 | 13.0 | 2 | United States of America | 33.0 | 25.0 | 4.0 | 4.0 | 1.3081 | 1.0 | 0.3030 | 0.752558 | 0.4848 | 0.8485 | 0.65630 | 0 |
| 13 | 1959.0 | 14.0 | 2 | United States of America | 54.0 | 23.0 | 17.0 | 14.0 | 1.6179 | 1.0 | 0.1296 | 0.752558 | 0.3396 | 0.7593 | 0.70590 | 0 |
| 14 | 1960.0 | 15.0 | 2 | United States of America | 103.0 | 53.0 | 36.0 | 14.0 | 1.5740 | 1.0 | 0.2330 | 0.752558 | 0.2913 | 0.7184 | 0.67350 | 0 |
| 15 | 1961.0 | 16.0 | 2 | United States of America | 73.0 | 36.0 | 26.0 | 11.0 | 1.7276 | 1.0 | 0.1096 | 0.752558 | 0.2466 | 0.6986 | 0.61640 | 0 |
| 16 | 1962.0 | 17.0 | 2 | United States of America | 46.0 | 28.0 | 13.0 | 5.0 | 1.9215 | 1.0 | 0.1739 | 0.752558 | 0.4889 | 0.6087 | 0.52380 | 0 |
| 17 | 1963.0 | 18.0 | 2 | United States of America | 31.0 | 15.0 | 6.0 | 10.0 | 1.9040 | 1.0 | 0.1290 | 0.752558 | 0.3871 | 0.5484 | 0.50000 | 0 |
| 18 | 1965.0 | 20.0 | 2 | United States of America | 40.0 | 14.0 | 15.0 | 11.0 | 2.0057 | 1.0 | 0.2250 | 0.752558 | 0.3500 | 0.5750 | 0.53850 | 0 |
| 19 | 1966.0 | 21.0 | 2 | United States of America | 50.0 | 19.0 | 20.0 | 11.0 | 2.0598 | 1.0 | 0.1600 | 0.752558 | 0.2200 | 0.6000 | 0.68750 | 0 |
df.tail(20)
| year | assembly_session | state_code | state_name | all_votes | yes_votes | no_votes | abstain | idealpoint_estimate | affinityscore_usa | affinityscore_russia | affinityscore_china | affinityscore_india | affinityscore_brazil | affinityscore_israel | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9685 | 1996.0 | 51.0 | 990 | Samoa | 74.0 | 64.0 | 0.0 | 10.0 | 0.2483 | 0.3378 | 0.6575 | 0.6892 | 0.6216 | 0.8630 | 0.2917 | 2 |
| 9686 | 1997.0 | 52.0 | 990 | Samoa | 66.0 | 56.0 | 1.0 | 9.0 | 0.2542 | 0.3485 | 0.6818 | 0.6818 | 0.6212 | 0.8636 | 0.2769 | 2 |
| 9687 | 1998.0 | 53.0 | 990 | Samoa | 57.0 | 49.0 | 0.0 | 8.0 | 0.1540 | 0.2456 | 0.6667 | 0.6429 | 0.6140 | 0.8772 | 0.2281 | 2 |
| 9688 | 1999.0 | 54.0 | 990 | Samoa | 62.0 | 50.0 | 1.0 | 11.0 | 0.2465 | 0.2742 | 0.7258 | 0.5968 | 0.6129 | 0.8548 | 0.2787 | 2 |
| 9689 | 2000.0 | 55.0 | 990 | Samoa | 62.0 | 52.0 | 2.0 | 8.0 | 0.2734 | 0.2742 | 0.6774 | 0.6613 | 0.6129 | 0.8710 | 0.2419 | 2 |
| 9690 | 2001.0 | 56.0 | 990 | Samoa | 37.0 | 30.0 | 3.0 | 4.0 | 0.2417 | 0.3784 | 0.5405 | 0.5405 | 0.5135 | 0.7568 | 0.3784 | 0 |
| 9691 | 2002.0 | 57.0 | 990 | Samoa | 62.0 | 52.0 | 2.0 | 8.0 | 0.1959 | 0.1613 | 0.6167 | 0.6885 | 0.6290 | 0.8387 | 0.1967 | 2 |
| 9692 | 2003.0 | 58.0 | 990 | Samoa | 67.0 | 55.0 | 1.0 | 11.0 | 0.1806 | 0.1493 | 0.7164 | 0.6818 | 0.6269 | 0.7612 | 0.1818 | 2 |
| 9693 | 2004.0 | 59.0 | 990 | Samoa | 65.0 | 51.0 | 2.0 | 12.0 | 0.2033 | 0.1692 | 0.6563 | 0.6406 | 0.5692 | 0.6308 | 0.1563 | 2 |
| 9694 | 2005.0 | 60.0 | 990 | Samoa | 71.0 | 58.0 | 2.0 | 11.0 | 0.1984 | 0.1714 | 0.6429 | 0.7143 | 0.6479 | 0.7746 | 0.2239 | 2 |
| 9695 | 2006.0 | 61.0 | 990 | Samoa | 80.0 | 59.0 | 3.0 | 18.0 | 0.2754 | 0.1875 | 0.6500 | 0.6835 | 0.6076 | 0.7750 | 0.2911 | 2 |
| 9696 | 2007.0 | 62.0 | 990 | Samoa | 66.0 | 55.0 | 1.0 | 10.0 | 0.1086 | 0.0909 | 0.6769 | 0.7846 | 0.6970 | 0.8333 | 0.2121 | 2 |
| 9697 | 2008.0 | 63.0 | 990 | Samoa | 69.0 | 58.0 | 2.0 | 9.0 | 0.1493 | 0.1449 | 0.6232 | 0.6866 | 0.6377 | 0.7971 | 0.2464 | 2 |
| 9698 | 2009.0 | 64.0 | 990 | Samoa | 64.0 | 50.0 | 2.0 | 12.0 | 0.1717 | 0.1719 | 0.6563 | 0.7344 | 0.6406 | 0.7500 | 0.1587 | 2 |
| 9699 | 2010.0 | 65.0 | 990 | Samoa | 65.0 | 53.0 | 2.0 | 10.0 | 0.2148 | 0.2000 | 0.6462 | 0.7188 | 0.6462 | 0.8000 | 0.2031 | 2 |
| 9700 | 2011.0 | 66.0 | 990 | Samoa | 59.0 | 48.0 | 1.0 | 10.0 | 0.2350 | 0.2881 | 0.6102 | 0.6780 | 0.6271 | 0.7458 | 0.2105 | 2 |
| 9701 | 2012.0 | 67.0 | 990 | Samoa | 68.0 | 56.0 | 0.0 | 12.0 | 0.2359 | 0.1912 | 0.6324 | 0.6515 | 0.6618 | 0.7941 | 0.1765 | 2 |
| 9702 | 2013.0 | 68.0 | 990 | Samoa | 62.0 | 51.0 | 0.0 | 11.0 | 0.1735 | 0.1935 | 0.5806 | 0.7097 | 0.6452 | 0.7742 | 0.1500 | 2 |
| 9703 | 2014.0 | 69.0 | 990 | Samoa | 75.0 | 65.0 | 0.0 | 10.0 | 0.1007 | 0.2344 | 0.5714 | 0.6984 | 0.6615 | 0.8154 | 0.2000 | 2 |
| 9704 | 2015.0 | 70.0 | 990 | Samoa | 67.0 | 59.0 | 0.0 | 8.0 | -0.0227 | 0.2090 | 0.5672 | 0.6866 | 0.6567 | 0.8507 | 0.1642 | 2 |
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9705 entries, 0 to 9704 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 9697 non-null float64 1 assembly_session 9697 non-null float64 2 state_code 9705 non-null int64 3 state_name 9697 non-null object 4 all_votes 9697 non-null float64 5 yes_votes 9697 non-null float64 6 no_votes 9697 non-null float64 7 abstain 9697 non-null float64 8 idealpoint_estimate 9697 non-null float64 9 affinityscore_usa 9696 non-null float64 10 affinityscore_russia 9692 non-null float64 11 affinityscore_china 7608 non-null float64 12 affinityscore_india 9696 non-null float64 13 affinityscore_brazil 9696 non-null float64 14 affinityscore_israel 9585 non-null float64 dtypes: float64(13), int64(1), object(1) memory usage: 1.1+ MB None
df.isnull().sum()
year 8 assembly_session 8 state_code 0 state_name 8 all_votes 8 yes_votes 8 no_votes 8 abstain 8 idealpoint_estimate 8 affinityscore_usa 9 affinityscore_russia 13 affinityscore_china 2097 affinityscore_india 9 affinityscore_brazil 9 affinityscore_israel 120 dtype: int64
# Drop rows with null values in the 'state_name' column
df = df.dropna(subset=['state_name'])
df.isnull().sum()
year 0 assembly_session 0 state_code 0 state_name 0 all_votes 0 yes_votes 0 no_votes 0 abstain 0 idealpoint_estimate 0 affinityscore_usa 1 affinityscore_russia 5 affinityscore_china 2089 affinityscore_india 1 affinityscore_brazil 1 affinityscore_israel 112 dtype: int64
sns.heatmap(df.isnull(),yticklabels=False,cmap='flare')
<Axes: >
df.shape
(9697, 15)
# Calculate the mean of each column
mean_affinityscore_usa = df['affinityscore_usa'].mean()
mean_affinityscore_russia = df['affinityscore_russia'].mean()
mean_affinityscore_india = df['affinityscore_india'].mean()
mean_affinityscore_brazil = df['affinityscore_brazil'].mean()
mean_affinityscore_israel = df['affinityscore_israel'].mean()
mean_affinityscore_china = df['affinityscore_china'].mean()
# Replace null values with mean of each column using .loc accessor
df.loc[df['affinityscore_usa'].isnull(), 'affinityscore_usa'] = mean_affinityscore_usa
df.loc[df['affinityscore_russia'].isnull(), 'affinityscore_russia'] = mean_affinityscore_russia
df.loc[df['affinityscore_india'].isnull(), 'affinityscore_india'] = mean_affinityscore_india
df.loc[df['affinityscore_brazil'].isnull(), 'affinityscore_brazil'] = mean_affinityscore_brazil
df.loc[df['affinityscore_israel'].isnull(), 'affinityscore_israel'] = mean_affinityscore_israel
df.loc[df['affinityscore_china'].isnull(), 'affinityscore_china'] = mean_affinityscore_china
import matplotlib.pyplot as plt
# Calculate the mean affinity score for each country
mean_affinity_usa = df['affinityscore_usa'].mean()
mean_affinity_russia = df['affinityscore_russia'].mean()
mean_affinity_china = df['affinityscore_china'].mean()
mean_affinity_india = df['affinityscore_india'].mean()
mean_affinity_brazil = df['affinityscore_brazil'].mean()
mean_affinity_israel = df['affinityscore_israel'].mean()
# Create lists of means and corresponding countries
countries = ['USA', 'Russia', 'China', 'India', 'Brazil', 'Israel']
means = [mean_affinity_usa, mean_affinity_russia, mean_affinity_china, mean_affinity_india, mean_affinity_brazil, mean_affinity_israel]
# Plotting the mean affinity scores
plt.figure(figsize=(10, 6))
plt.bar(countries, means, color=['blue', 'violet', 'red', 'orange', 'purple', 'pink'])
plt.xlabel('Country')
plt.ylabel('Mean Affinity Score')
plt.title('Mean Affinity Score Comparison of Countries')
plt.grid(axis='y')
plt.show()
The graph compares the mean affinity score of six countries: USA, Russia, China, India, Brazil and Israel.
The countries on the x-axis are listed alphabetically from Brazil to USA. The y-axis shows the mean affinity score. The scale goes from 0 to 0.7
The countries with the highest mean affinity score are China and India (at about 0.65). The United States and Russia have the lowest mean affinity score (at about 0.15).
df.isnull().sum()
year 0 assembly_session 0 state_code 0 state_name 0 all_votes 0 yes_votes 0 no_votes 0 abstain 0 idealpoint_estimate 0 affinityscore_usa 0 affinityscore_russia 0 affinityscore_china 0 affinityscore_india 0 affinityscore_brazil 0 affinityscore_israel 0 dtype: int64
df.describe()
| year | assembly_session | state_code | all_votes | yes_votes | no_votes | abstain | idealpoint_estimate | affinityscore_usa | affinityscore_russia | affinityscore_china | affinityscore_india | affinityscore_brazil | affinityscore_israel | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 | 9697.000000 |
| mean | 1987.037125 | 42.037125 | 446.914613 | 75.246674 | 59.793235 | 5.647932 | 9.805507 | -0.000279 | 0.293589 | 0.620304 | 0.752558 | 0.687353 | 0.733821 | 0.350540 |
| std | 18.478671 | 18.478671 | 258.472010 | 33.043167 | 32.657278 | 8.268681 | 9.713783 | 0.989763 | 0.203660 | 0.202982 | 0.160692 | 0.195748 | 0.186575 | 0.189557 |
| min | 1946.000000 | 1.000000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | -2.562400 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1973.000000 | 28.000000 | 220.000000 | 58.000000 | 38.000000 | 0.000000 | 4.000000 | -0.661100 | 0.140800 | 0.512200 | 0.723100 | 0.526300 | 0.615400 | 0.193500 |
| 50% | 1989.000000 | 44.000000 | 437.000000 | 68.000000 | 57.000000 | 2.000000 | 7.000000 | -0.175500 | 0.235300 | 0.652200 | 0.761200 | 0.754100 | 0.800000 | 0.325900 |
| 75% | 2003.000000 | 58.000000 | 660.000000 | 86.000000 | 71.000000 | 9.000000 | 13.000000 | 0.808900 | 0.388100 | 0.737700 | 0.869600 | 0.835800 | 0.880000 | 0.466700 |
| max | 2015.000000 | 70.000000 | 990.000000 | 158.000000 | 156.000000 | 98.000000 | 73.000000 | 3.004200 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
df.nunique()
year 69 assembly_session 69 state_code 198 state_name 197 all_votes 158 yes_votes 157 no_votes 70 abstain 71 idealpoint_estimate 8377 affinityscore_usa 2257 affinityscore_russia 2411 affinityscore_china 1687 affinityscore_india 2276 affinityscore_brazil 2120 affinityscore_israel 2023 dtype: int64
import matplotlib.pyplot as plt
# Histograms for each numerical feature
df.hist(bins=20, figsize=(15, 10))
plt.show()
countries = ['USA', 'Russia', 'China', 'India', 'Brazil', 'Israel']
means = [mean_affinity_usa, mean_affinity_russia, mean_affinity_china, mean_affinity_india, mean_affinity_brazil, mean_affinity_israel]
# Plotting the mean affinity scores as a pie chart
plt.figure(figsize=(8, 8))
plt.pie(means, labels=countries, autopct='%1.1f%%', startangle=140)
plt.title('Mean Affinity Score Comparison of Countries')
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
# Assuming you have loaded your dataset into a DataFrame named df
# Let's group the DataFrame by 'state_name' and 'assembly_session' and calculate the sum of votes
grouped_df = df.groupby(['state_name', 'assembly_session']).sum().reset_index()
# Stacked Bar Plot for Yes, No, and Abstain votes
plt.figure(figsize=(20, 15))
for country, data in grouped_df.groupby('state_name'):
plt.bar(data['assembly_session'], data['yes_votes'], label=f'{country} Yes', alpha=0.7)
plt.bar(data['assembly_session'], data['no_votes'], bottom=data['yes_votes'], label=f'{country} No', alpha=0.7)
plt.bar(data['assembly_session'], data['abstain'], bottom=data['yes_votes']+data['no_votes'], label=f'{country} Abstain', alpha=0.7)
plt.xlabel('Assembly Session')
plt.ylabel('Votes')
plt.title('Votes by Country and Assembly Session')
plt.legend()
plt.show()
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/IPython/core/pylabtools.py:170: UserWarning: Creating legend with loc="best" can be slow with large amounts of data. fig.canvas.print_figure(bytes_io, **kw)
import matplotlib.pyplot as plt
# Assuming you have already defined grouped_df correctly
plt.figure(figsize=(15, 10))
for country, data in grouped_df.groupby('state_name'):
plt.fill_between(data['assembly_session'], data['yes_votes'], label=f'{country} Yes', alpha=0.7)
plt.fill_between(data['assembly_session'], data['no_votes'], label=f'{country} No', alpha=0.7)
plt.fill_between(data['assembly_session'], data['abstain'], label=f'{country} Abstain', alpha=0.7)
plt.xlabel('Assembly Session')
plt.ylabel('Votes')
plt.title('Votes by Country and Assembly Session')
plt.legend()
plt.show()
# Create a pivot table for heatmap
heatmap_df = grouped_df.pivot(index='assembly_session', columns='state_name', values=['yes_votes', 'no_votes', 'abstain'])
# Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_df, cmap='viridis', linewidths=0.5)
plt.title('Votes by Country and Assembly Session')
plt.xlabel('Country')
plt.ylabel('Assembly Session')
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
# Assuming your DataFrame is named df
# Filter data for the desired countries
desired_countries = ['Brazil', 'China', 'United States of America', 'Israel', 'Russia', 'India']
filtered_df = df[df['state_name'].isin(desired_countries)]
# Group by assembly session and calculate sum of yes, no, and abstain votes
grouped_df = filtered_df.groupby(['assembly_session', 'state_name']).sum().reset_index()
# Plotting
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(12, 18)) # Adjusted figsize
fig.patch.set_facecolor('#f5f5f5') # Set background color for the figure
fig.patch.set_linewidth(1) # Set border width for the figure
fig.patch.set_edgecolor('black') # Set border color for the figure
# Set background color for the plot area
for ax in axes:
ax.set_facecolor('#e5e5e5') # Set background color for the plot area
ax.spines['top'].set_visible(True) # Show top spine/border
ax.spines['right'].set_visible(True) # Show right spine/border
ax.spines['bottom'].set_linewidth(1) # Set border width for the bottom spine
ax.spines['left'].set_linewidth(1) # Set border width for the left spine
ax.spines['top'].set_linewidth(1) # Set border width for the top spine
ax.spines['right'].set_linewidth(1) # Set border width for the right spine
# Plot for Yes votes
axes[0].set_title('Yes Votes by Country for Each Assembly Session', fontsize=16, fontweight='bold', color='blue')
for country, data in grouped_df.groupby('state_name'):
axes[0].plot(data['assembly_session'], data['yes_votes'], label=country, linewidth=2)
axes[0].set_xlabel('Assembly Session', fontsize=14)
axes[0].set_ylabel('Yes Votes', fontsize=14)
axes[0].legend(fontsize=12)
axes[0].grid(True, linestyle='--', alpha=0.5) # Add grid lines
# Plot for No votes
axes[1].set_title('No Votes by Country for Each Assembly Session', fontsize=16, fontweight='bold', color='red')
for country, data in grouped_df.groupby('state_name'):
axes[1].plot(data['assembly_session'], data['no_votes'], label=country, linewidth=2)
axes[1].set_xlabel('Assembly Session', fontsize=14)
axes[1].set_ylabel('No Votes', fontsize=14)
axes[1].legend(fontsize=12)
axes[1].grid(True, linestyle='--', alpha=0.5) # Add grid lines
# Plot for Abstain votes
axes[2].set_title('Abstain Votes by Country for Each Assembly Session', fontsize=16, fontweight='bold', color='green')
for country, data in grouped_df.groupby('state_name'):
axes[2].plot(data['assembly_session'], data['abstain'], label=country, linewidth=2)
axes[2].set_xlabel('Assembly Session', fontsize=14)
axes[2].set_ylabel('Abstain Votes', fontsize=14)
axes[2].legend(fontsize=12)
axes[2].grid(True, linestyle='--', alpha=0.5) # Add grid lines
plt.tight_layout()
plt.show()
# Calculate the mean affinity score for each country
mean_affinity_usa = df['affinityscore_usa'].mean()
mean_affinity_russia = df['affinityscore_russia'].mean()
mean_affinity_china = df['affinityscore_china'].mean()
mean_affinity_india = df['affinityscore_india'].mean()
mean_affinity_brazil = df['affinityscore_brazil'].mean()
mean_affinity_israel = df['affinityscore_israel'].mean()
# Create lists of means and corresponding countries
countries = ['USA', 'Russia', 'China', 'India', 'Brazil', 'Israel']
means = [mean_affinity_usa, mean_affinity_russia, mean_affinity_china, mean_affinity_india, mean_affinity_brazil, mean_affinity_israel]
# Create the Highcharts HTML code
highcharts_html = """
<!DOCTYPE html>
<html>
<head>
<title>Mean Affinity Score Comparison of Countries</title>
<script src="https://code.highcharts.com/highcharts.js"></script>
</head>
<body>
<div id="container" style="width: 600px; height: 400px; margin: 0 auto"></div>
<script>
Highcharts.chart('container', {{
chart: {{
type: 'bar'
}},
title: {{
text: 'Mean Affinity Score Comparison of Countries'
}},
xAxis: {{
categories: {categories}
}},
yAxis: {{
title: {{
text: 'Mean Affinity Score'
}}
}},
series: [{{
name: 'Mean Affinity Score',
data: {means}
}}]
}});
</script>
</body>
</html>
""".format(categories=countries, means=means)
# Save the HTML output to a file or display it
print(highcharts_html)
<!DOCTYPE html>
<html>
<head>
<title>Mean Affinity Score Comparison of Countries</title>
<script src="https://code.highcharts.com/highcharts.js"></script>
</head>
<body>
<div id="container" style="width: 600px; height: 400px; margin: 0 auto"></div>
<script>
Highcharts.chart('container', {
chart: {
type: 'bar'
},
title: {
text: 'Mean Affinity Score Comparison of Countries'
},
xAxis: {
categories: ['USA', 'Russia', 'China', 'India', 'Brazil', 'Israel']
},
yAxis: {
title: {
text: 'Mean Affinity Score'
}
},
series: [{
name: 'Mean Affinity Score',
data: [0.29358908828382835, 0.620303931077177, 0.7525578338590958, 0.6873531043729373, 0.7338206373762376, 0.3505397496087637]
}]
});
</script>
</body>
</html>
html_content = """
<!DOCTYPE html>
<html>
<head>
<title>Mean Affinity Score Comparison of Countries</title>
<script src="https://code.highcharts.com/highcharts.js"></script>
</head>
<body>
<div id="container" style="width: 600px; height: 400px; margin: 0 auto"></div>
<script>
Highcharts.chart('container', {
chart: {
type: 'bar'
},
title: {
text: 'Mean Affinity Score Comparison of Countries'
},
xAxis: {
categories: ['USA', 'Russia', 'China', 'India', 'Brazil', 'Israel']
},
yAxis: {
title: {
text: 'Mean Affinity Score'
}
},
series: [{
name: 'Mean Affinity Score',
data: [0.29358908828382835, 0.620303931077177, 0.6659521023958656, 0.6873531043729373, 0.7338206373762376, 0.3505397496087637]
}]
});
</script>
</body>
</html>
"""
# Write the HTML content to a file
with open('affinity_score_comparison.html', 'w') as file:
file.write(html_content)
print("HTML file saved successfully!")
HTML file saved successfully!
import pandas as pd
import matplotlib.pyplot as plt
# Assuming 'df' is your DataFrame with columns 'yes_votes', 'no_votes', and 'abstain'
# Step 1: Calculate the sum of each column
sum_yes_votes = df['yes_votes'].sum()
sum_no_votes = df['no_votes'].sum()
sum_abstain = df['abstain'].sum()
# Step 2: Add up the sums of the three columns to get the total
total_votes = sum_yes_votes + sum_no_votes + sum_abstain
# Step 3: Calculate the percentage of each sum
yes_percentage = (sum_yes_votes / total_votes) * 100
no_percentage = (sum_no_votes / total_votes) * 100
abstain_percentage = (sum_abstain / total_votes) * 100
# Step 4: Create a pie chart
labels = ['Yes', 'No', 'Abstain']
sizes = [yes_percentage, no_percentage, abstain_percentage]
colors = ['green', 'red', 'yellow']
explode = (0.1, 0, 0) # explode 1st slice (optional)
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Total Votes Distribution')
plt.show()
# Time Series Analysis
plt.figure(figsize=(10, 6))
sns.lineplot(x='year', y='all_votes', data=df)
plt.title('Trend of All Votes over Years')
plt.xlabel('Year')
plt.ylabel('All Votes')
plt.show()
# Correlation Analysis
# correlation_matrix = df.corr()
# plt.figure(figsize=(10, 8))
# sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
# plt.title('Correlation Matrix')
# plt.show()
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
numeric_data = df[numeric_columns]
# Visualize correlation matrix
correlation_matrix = numeric_data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
# Distribution Analysis
plt.figure(figsize=(12, 6))
sns.histplot(df['idealpoint_estimate'], kde=True, bins=30)
plt.title('Distribution of Ideal Point Estimate')
plt.xlabel('Ideal Point Estimate')
plt.ylabel('Frequency')
plt.show()
Random Forest Classifier¶
- The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It is a set of decision trees (DT) from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction.

# import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load the data (replace 'data.csv' with your actual file path)
# data = pd.read_csv('data.csv')
# Prepare data
X = df[['affinityscore_usa', 'affinityscore_russia',
'affinityscore_china', 'affinityscore_india',
'affinityscore_brazil', 'affinityscore_israel']]
y = df['yes_votes'].apply(lambda x: 'high' if x > df['yes_votes'].mean() else 'low') # Target variable
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the model
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
# Predictions
y_pred = clf.predict(X_test)
# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
Accuracy: 0.9010309278350516
Classification Report:
precision recall f1-score support
high 0.86 0.92 0.89 840
low 0.94 0.89 0.91 1100
accuracy 0.90 1940
macro avg 0.90 0.90 0.90 1940
weighted avg 0.90 0.90 0.90 1940
[[773 67]
[125 975]]
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
# Generate classification report
report = classification_report(y_test, y_pred, output_dict=True)
# Convert the classification report to a DataFrame
report_df = pd.DataFrame(report).transpose()
# Plot the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(report_df.iloc[:-1, :-1], annot=True, cmap="YlGnBu", fmt=".2f")
plt.title('Classification Report Heatmap')
plt.xlabel('Metrics')
plt.ylabel('Class')
plt.show()
K Means¶
- Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified data and enabling the algorithm to operate on that data without supervision. Without any previous data training, the machine’s job in this case is to organize unsorted data according to parallels, patterns, and variations.

import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Prepare data
X = df[['affinityscore_usa', 'affinityscore_russia',
'affinityscore_china', 'affinityscore_india',
'affinityscore_brazil', 'affinityscore_israel']]
# Initialize and fit K-means model
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Add cluster labels to the DataFrame
df['cluster'] = kmeans.labels_
# Visualize clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x='yes_votes', y='no_votes', hue='cluster', data=df, palette='Set1', legend='full')
plt.title('Clustering of States based on Voting Behavior')
plt.xlabel('Yes Votes')
plt.ylabel('No Votes')
plt.show()
/var/folders/4y/p5wbs2392ng1qdh47g8kk6x00000gn/T/ipykernel_38761/2939899213.py:17: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['cluster'] = kmeans.labels_
from sklearn.metrics import silhouette_score
# Calculate silhouette score
silhouette_avg = silhouette_score(X, kmeans.labels_)
print("Silhouette Score:", silhouette_avg)
Silhouette Score: 0.40464240364072335
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
# Prepare data
X = df[['affinityscore_usa', 'affinityscore_russia',
'affinityscore_china', 'affinityscore_india',
'affinityscore_brazil', 'affinityscore_israel']]
# Initialize and fit K-means model
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Add cluster labels to the DataFrame
df['cluster'] = kmeans.labels_
# Visualize clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x='affinityscore_india', y='affinityscore_china', hue='cluster', data=df, palette='Set1', legend='full')
plt.title('Clustering of States based on Affinity Scores')
plt.xlabel('Affinity Score - Inida')
plt.ylabel('Affinity Score - China')
plt.show()
/var/folders/4y/p5wbs2392ng1qdh47g8kk6x00000gn/T/ipykernel_38761/2696341349.py:16: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['cluster'] = kmeans.labels_
from sklearn.metrics import silhouette_score
# Calculate silhouette score
silhouette_avg = silhouette_score(X, kmeans.labels_)
print("Silhouette Score:", silhouette_avg)
Silhouette Score: 0.40464240364072335
# Group the data by state and calculate the sum of votes for each category
state_votes = df.groupby('state_name')[['all_votes', 'yes_votes', 'no_votes', 'abstain']].sum()
# Plotting bar charts for each voting category
state_votes.plot(kind='bar', stacked=True, figsize=(50, 20))
plt.title('Voting Patterns Across Different States')
plt.xlabel('State')
plt.ylabel('Number of Votes')
plt.xticks(rotation=45) # Rotate state names for better readability
plt.legend(title='Vote Category')
plt.tight_layout() # Corrected attribute name
plt.show()
# Calculate the sum of votes for each category
vote_distribution = df[['all_votes', 'yes_votes', 'no_votes', 'abstain']].sum()
# Plotting bar chart for vote distribution
plt.figure(figsize=(10, 6))
vote_distribution.plot(kind='bar', color=['blue', 'green', 'red', 'orange'])
plt.title('Vote Distribution')
plt.xlabel('Vote Type')
plt.ylabel('Number of Votes')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.show()
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Data preprocessing
# For simplicity, let's assume 'state_name' is the target variable and other columns are features
X = df.drop(columns=['state_name']) # Features
y = df['state_name'] # Target
# Encode categorical variables
le = LabelEncoder()
y = le.fit_transform(y)
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model training
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Model evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.5252577319587629
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Data preprocessing
# For simplicity, let's assume 'state_name' is dropped as we are clustering based on voting patterns
X = df.drop(columns=['state_name'])
# Model training
kmeans = KMeans(n_clusters=3) # Specify the number of clusters
kmeans.fit(X)
# Visualizing clusters
plt.scatter(X['no_votes'], X['yes_votes'], c=kmeans.labels_, cmap='viridis')
plt.xlabel('No Votes')
plt.ylabel('Yes Votes')
plt.title('Clustering of States based on Voting Patterns')
plt.show()
Linear Regression¶
- Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Assuming df is your DataFrame containing the data
# Data preprocessing
X = df['affinityscore_china'] # Features
y = df['affinityscore_india'] # Target
# Reshape X and y to be two-dimensional arrays
X = X.values.reshape(-1, 1) # Reshape X to a column vector
y = y.values.reshape(-1, 1) # Reshape y to a column vector
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model training
model = LinearRegression()
model.fit(X_train, y_train)
# Model evaluation
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
Mean Squared Error: 0.015783713134011237
import matplotlib.pyplot as plt
# Plotting the relationship between the independent variable (X) and the dependent variable (y)
plt.figure(figsize=(8, 6))
plt.scatter(X, y, color='blue', alpha=0.5)
plt.title('Scatter Plot of affinityscore_china vs. affinityscore_india')
plt.xlabel('affinityscore_china')
plt.ylabel('affinityscore_india')
plt.grid(True)
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
# Assuming 'year' is the time variable and 'all_votes' is the target variable
ts_data = df[['year', 'all_votes']]
ts_data.set_index('year', inplace=True)
# Model training
model = ARIMA(ts_data, order=(5,1,0))
fit_model = model.fit()
# Forecasting
forecast_index = pd.date_range(start='2025', end='2050', freq='Y') # Generating date range for forecasting
forecast = fit_model.forecast(steps=len(forecast_index)) # Forecasting until 2050
# Plotting forecast
plt.plot(ts_data, label='Actual')
plt.plot(forecast_index, forecast, label='Forecast')
plt.title('Forecasting Future Voting Patterns')
plt.xlabel('Year')
plt.ylabel('Number of Votes')
plt.legend()
plt.show()
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. self._init_dates(dates, freq) /Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. self._init_dates(dates, freq) /Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. self._init_dates(dates, freq) /var/folders/4y/p5wbs2392ng1qdh47g8kk6x00000gn/T/ipykernel_38761/1475418556.py:14: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead. forecast_index = pd.date_range(start='2025', end='2050', freq='Y') # Generating date range for forecasting /Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index(
# Calculate residuals
residuals = y_test - y_pred
# Plot residuals
plt.figure(figsize=(8, 6))
plt.scatter(y_pred, residuals, color='blue', alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals Plot')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()
ARIMA(AutoRegressive Integrated Moving Average)¶
An autoregressive integrated moving average, or ARIMA, is a statistical analysis model that uses time series data to either better understand the data set or to predict future trends.
A statistical model is autoregressive if it predicts future values based on past values. For example, an ARIMA model might seek to predict a stock's future prices based on its past performance or forecast a company's earnings based on past periods.
from sklearn.metrics import mean_squared_error
import numpy as np
# Splitting the dataset into training and testing sets
train_size = int(len(target_data) * 0.8) # Using 80% of the data for training
train_data, test_data = target_data.iloc[:train_size], target_data.iloc[train_size:]
# Model training
model = ARIMA(train_data, order=(5, 1, 0))
fit_model = model.fit()
# Forecasting
forecast = fit_model.forecast(steps=len(test_data)) # Forecasting on the testing data
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(test_data, forecast)
# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. self._init_dates(dates, freq) /Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. self._init_dates(dates, freq) /Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. self._init_dates(dates, freq)
Mean Squared Error (MSE): 1145.1768315695354 Root Mean Squared Error (RMSE): 33.84046145621444
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index(
from sklearn.metrics import mean_squared_error
import numpy as np
# Splitting the dataset into training and testing sets
train_size = int(len(target_data) * 0.8) # Using 80% of the data for training
train_data, test_data = target_data.iloc[:train_size], target_data.iloc[train_size:]
# Model training
model = ARIMA(train_data, order=(5, 1, 0))
fit_model = model.fit()
# Forecasting
forecast = fit_model.forecast(steps=len(test_data)) # Forecasting on the testing data
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(test_data, forecast)
# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
# Print the forecast
print("Forecast:")
print(forecast)
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. self._init_dates(dates, freq) /Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. self._init_dates(dates, freq) /Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. self._init_dates(dates, freq)
Mean Squared Error (MSE): 1145.1768315695354
Root Mean Squared Error (RMSE): 33.84046145621444
Forecast:
7757 64.542714
7758 64.554631
7759 65.089764
7760 64.674080
7761 64.790490
...
9692 64.779412
9693 64.779412
9694 64.779412
9695 64.779412
9696 64.779412
Name: predicted_mean, Length: 1940, dtype: float64
/Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /Users/shrutimall/.local/pipx/.cache/5c9468f9a0a782a/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index(